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REMARKS 

In the application, Claims 3, 6-20 and 23 are pending and rejected. The Examiner has 
indicated that the amendment filed on March 23, 2004 has been entered but did not place the 
application in condition for allowance. The listing of claims reflects all amendments through the 
March 23, 2004 amendment; no additional amendments have been made. Nonetheless, in 
conjunction with the filing of a Request for Continued Examination, Applicants now submit 
additional remarks in support of patentability. 

The Examiner, relying on the Stoeckert, et al. article {Nature Genetics Supplement 32:469, 
Dec. 2002) entitled "Microarray databases: standards and ontologies", maintains his position that 
the written description does not provide enabling disclosure as required under 35 U.S.C. §1 12, 1 st 
paragraph. Specifically, the Examiner identifies the issue as the "unpredictability as to what use that 
such displayed data is applied to. "(Office Action dated 1 1/19/03, p. 3.) 

Applicants wish to reiterate that the Stoeckert et al. article is directed to adoption of common 
standards and ontologies that will permit exchange of microarray-based experiment data among 
different research labs. This article expressly addresses microarray databases and cannot be read 
to mean that all gene expression data or the results of analysis of such data are by their nature 
unpredictable as to their application. Nothing in the claims or in the written description limit the 
invention to analysis of microarray data, or to gene expression data generated by any other method. 
The claims are drawn to gene expression data, regardless of the source. The broad application of 
conclusions made by the authors relative to microarray databases to gene expression data as a whole 
is inappropriate. The point the Stoeckert et al. make is that standards should be instituted to permit 
sharing of data to avoid inconsistencies across different data formats. While use of different 
terminologies and annotations may make inter-format data sharing difficult, it does not prevent the 
use or analysis of all gene expression data, nor does it impair the application of the results of such 
analyses to a particular purpose. 

Attached hereto as Exhibit A is a reprint of an article published in Nature (Lockhart, D.J. & 
Winzeler, E.A., Nature 405 15 June 2000 827-836), more than two years before the Stoeckert et al. 
article, which provides a review of gene expression monitoring and its uses. This article includes 
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discussion of microarray-based data and its usefulness including measurement of transcriptional 
changes as cells progress through normal cell cycle division (p. 829), discovery of gene expression 
markers for classification of cancers (p. 830), identification of functional relationships between genes 
(p. 831), and more. These are all predictable applications of gene expression analysis, even if one 
cannot predict what the exact result will be. One point the authors make at page 829, speaking about 
microarrays, is that "[t]he breath of array-based observations almost guarantees that surprising 
findings will be made." Thus, while not all results may have a predictable usage, this does not make 
them any less useful. These results further the pool of knowledge and that, itself, is useful. The 
Examiner's requirement that the displayed data have a predictable use fails to recognize that there 
is a predictable application in the advancement of knowledge resulting from the comparative analysis 
of gene expression data. 

Applicants wish to point out that the claims are drawn to a method for displaying data on 
gene expression or differences in gene expression. The Examiner's requirement that the 
displayed data have some predictable use exceeds the scope of the claims and the requirements 
for patentability. The invention is drawn to a method for displaying data. The manipulation of 
data to generate a display qualifies as patentable subject matter. ( See In re Alappat 33 F.3d 
1526, 31 USPQ2d 1545 (Fed. Cir. 1994) (in banc ).) Using the data elements and steps called for 
in the claims, a display will predictably be generated. Accordingly, there is nothing 
unpredictable about the claimed invention. 

The claimed invention provides a means for representing (displaying) gene expression 
data. As stated by Lockhart et al., the gene expression data "needs to be more than just stored; it 
needs to be available in a way that helps scientists understand and interpret the often complex 
observations that are becoming increasingly easy to make" (p. 834). The inventive method for 
displaying gene expression data is one means for "allowing connections to be made between 
initially disparate observations and information, and across organisms." (p. 835). In other words, 
different approaches to visualizing the data can be useful in identifying patterns within the data 
that may not have previously been known to exist. While the result of any given observation 
may not be predictable, the result of expressing the observation in a visual display is. It is a tool 
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which facilitates recognition of gene expression patterns among different sources and/or under 
varied conditions. 

In view of the foregoing remarks, Applicants submit that all bases for rejection have been 
addressed and overcome such that the claims as now presented are enabled and allowable over 
the prior art. Accordingly, Applicants respectfully request that the Examiner withdraw all 
outstanding rejections and issue a notice of allowance for all claims now in the application. 
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Genomics, gene expression and 
DNA arrays 

David J. Lockhart & Elizabeth A. Winzeler 

Genomics Institute of theNovartis Research Foundation, 31 15 Merry field Row, San Diego, California 92 121, USA 



Experimental genomics in combination with the growing body of sequence information promise to 
revolutionize the way cells and cellular processes are studied. Information on genomic sequence can be used 
experimentally with high-density DNA arrays that allow complex mixtures of RNA and DNA to be interrogated 
in a parallel and quantitative fashion. DNA arrays can be used for many different purposes, most prominently 
to measure levels of gene expression (messenger RNA abundance) for tens of thousands of genes 
simultaneously. Measurements of gene expression and other applications of arrays embody much of what is 
implied by the term 'genomics'; they are broad in scope, large in scale, and take advantage of all available 
sequence information for experimental design and data interpretation in pursuit of biological understanding. 



Biological and biomedical research is in the 
midst of a significant transition that is being 
driven by two primary factors: the massive 
increase in the amount of DNA sequence 
information and the development of 
technologies to exploit its use. Consequently, we find 
ourselves at a time when new types of experiments are 
possible, and observations, analyses and discoveries are 
being made on an unprecedented scale. Over the past few 
years, more than 30 organisms have had their genomes 
completely sequenced, with another 1 00 or so in progress 
(see www.tigr.org or genomes@ncbi.nlm.nih.gov for 
a list). At least partial sequence has been obtained for 
tens of thousands of mouse, rat and human genes, and 
the sequence of two entire human chromosomes 
(chromosomes 21 and 22) has been determined 1,2 . Within 
the year, a large proportion of the human genome will be 
deciphered, in both public and private efforts, and the 
complete sequence of the mouse and other animal and 
plant genomes will undoubtedly follow close behind. 
Unfortunately, the billions of bases of DNA sequence do 
not tell us what all the genes do, how cells work, how cells 
form organisms, what goes wrong in disease, how we age 
or how to develop a drug. This is where functional 
genomics comes into play. The purpose of genomics is to 
understand biology, not simply to identify the component 
parts, and the experimental and computational methods 
take advantage of as much sequence information as 
possible. In this sense, functional genomics is less a specific 
project or programme than it is a mindset and general 
approach to problems. The goal is not simply to provide a 
catalogue of all the genes and information about their 
functions, but to understand how the components work 
together to comprise functioning cells and organisms. 

To take full advantage of the large and rapidly increasing 
body of sequence information, new technologies are 
required. Among the most powerful and versatile tools for 
genomics are high-density arrays of oligonucleotides or com- 
plementary DNAs. Nucleic acid arrays work by hybridization 
of labelled RNA or DNA in solution to DNA molecules 
attached at specific locations on a surface. The hybridization 
of a sample to an array is, in effect, a highly parallel search by 
each molecule for a matching partner on an Affinity matrix', 
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with the eventual pairings of molecules on the surface 
determined by the rules of molecular recognition. Arrays of 
nucleic acids have been used for biological experiments for 
many years 3-8 . Traditionally, the arrays consisted of fragments 
of DNA, often with unknown sequence, spotted on a porous 
membrane (usually nylon). The arrayed DNA fragments 
often came from cDNA, genomic DNA or plasmid libraries, 
and the hybridized material was often labelled with a radioac- 
tive group. Recently, the use of glass as a substrate and fluores- 
cence for detection, together with the development of new 
technologies for synthesizing or depositing nucleic acids on 
glass slides at very high densities, have allowed the miniatur- 
ization of nucleic acid arrays with concomitant increases in 
experimental efficiency and information content 9 " 14 (Fig. 1). 

While making arrays with more than several hundred 
elements was until recently a significant technical 
achievement, arrays with more than 250,000 different 
oligonucleotide probes or 10,000 different cDNAs per 
square centimetre can now be produced in significant 
numbers 1516 . Although it is possible to synthesize or deposit 
DNA fragments of unknown sequence, the most common 
implementation is to design arrays based on specific 
sequence information, a process sometimes referred to as 
'downloading the genome onto a chip' (Fig. 1). There are 
several variations on this basic technical theme: the 
hybridization reaction may be driven (for example, by an 
electric field) 1718 ; other detection methods 19 besides fluores- 
cence can be used; and the surface maybe made of materials 
other than glass such as plastic, silicon, gold, a gel or 
membrane, or may even be comprised ofbeads at the ends of 
fibre-optic bundles 20-22 . Nonetheless, the key elements of 
parallel hybridization to localized, surface-bound nucleic 
acid probes and subsequent counting of bound molecules 
are ubiquitous, and high-density arrays of nucleic acids on 
glass (often called DNA microarrays, oligonucleotide 
arrays, GeneChip arrays, or simply 'chips') and their 
biological uses will be the focus of this review. 

Global gene expression experiments 

One of the most important applications for arrays so far is the 
monitoring of gene expression (mRNA abundance). The col- 
lection of genes that are expressed or transcribed from 
genomic DNA, sometimes referred to as the expression 
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Figure 1 Principal types of arrays used in gene expression monitoring. Nucleic acid 
arrays are generally produced in one of two ways: by robotic deposition of nucleic acids 
(PCR products, plasmids or oligonucleotides) onto a glass slide 25 or in situ synthesis 
(using photolithography 15 ) of oligonucleotides. Shown are pseudocolour images of 
a, an oligonucleotide array and b, a cDNA array after hybridization of labelled samples, 
and fluorescence detection. In both cases the images have been coloured to indicate ■ 
the relative number of yeast transcripts present under two different growth conditions ■ 
(red, high in condition 1 , low in condition 2; green, high' in condition 2, low in condition 
1 ; yellow, high under both conditions; black, low under both conditions): In the case of 
photolithographically synthesized arrays, -10 7 copies,bf each selected oligonucleotide 
(usually 20 to 25 nucleotides in lengtti) are synthesized base by base in hundreds of 
thousands of different 24 jxm x 24 p,m areas on a 1 .28 cm x 1 . 28 cm glass 
surface. For robotic deposition, approximately one nanogram of material is deposited 
at intervals of 1 00-300 p-m. Typically for oligonucleotide arrays, multiple probes per 
gene are placed on the array (20 pairs in the example shown here), while in the case of 
robotic deposition, a single, longer (up to 1 ,000 bp) double-stranded DNA probe is 
used for each gene or EST. In both cases, probes are usually designed from sequence 
located nearer to the 3' end of the gene (near the poly-A tail in eukaryptic mRNA), and 
different probes can be used for different exons. After hybridization of labelled samples 
(typically overnight), the arrays are scanned and the quantitative fluorescence image 
along with the known identity of the probes is used to assess the 'presence' or 
'absence* (more precisely, the detectability above thresholds based on background and 
noise levels) of a particular molecule (such as a transcript), and its relative abundance 
in one or more samples. Because the sequence of the oligonucleotide or cONA at each 
physical location (or address) is generally known or can be determined, and because 



the recognition rules that govern hybridization are well understood, the signal intensity 
at each position gives not only a measure of the number of molecules bound, but also 
the likely identity of -the molecules. Although oligonucleotide probes vary systematically 
in their hybridization efficiency, quantitative estimates of the number of transcripts per 
cell can be obtained directly by averaging the signal from multiple probes 1 St28 ^. For 
technical reasons, the information obtained from spotted cDNA arrays gives the relative : 
concentration (ratio).of a given transcript in two different samples (derived from 
competitive, two-colour hybridizations). Messenger RNAs present at a few copies , 
(relative abundance of ~1 : 1 00,000 or less) to thousands of copies per mammalian cell 
can be detected 25,26,30 , and changes as subtle as a factor of 1 .3 to 2 can be reliably 
■ detected if replicate experiments are performed, c, Different methods for preparing 
labelled material for measurements of gene expression. The RNA can be labelled 
directly, using a psoralen-biotin derivative or by ligation to an RNA molecule carrying 
biotin 26 ; labelled nucleotides can be incorporated into cDNA during or after reverse 
transcription of polyadenylated RNA; or cDNA can be generated that carries a 17 
promoter at its 5' end. In the last case, the double-stranded cDNA serves as template 
for a reverse transcription reaction in which labelled nucleotides are incorporated into 
cRNA. Commonly used labels include the fluorophores fluorescein, Cy3 (or Cy5), or 
nonfluorescent biotin, which is subsequently labelled by staining with a fluorescent 
streptavidin conjugate, d, Two-colour hybridization strategy often used with cDNA 
microarrays. cDNA from two different conditions is labelled with two different 
fluorescent dyes (usually Cy3 and Cy5), and the two samples are co-hybridized to an 
array. After washing, the array is scanned at two different wavelengths to detect the 
relative transcript abundance for each condition. cONA array image courtesy of J. 
DeRisi and P. 0. Brown (http://cmgm.stanford.edu/pbrown/yeastchip.html). 
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Figure 2 Messenger RNA abundance levels in different 
cells, tissues and organisms, a, Human HIV-infected T 
lymphocytes; b, mouse olfactory epithelium; c, rat brain; 
d, S. cerevisiae strain RY1 36 grown at 25 °C in rich medium, 
levels of gene expression were measured using Affymetrix 
oligonucleotide arrays. For human, mouse and rat samples, 
hybridization intensities were converted to copies per cell (top 
axis) based on the signal from multiple control RNAs added to 
the samples at known concentrations. For yeast, the 
conversion was based on the signal from the TATA : binding 
protein (TBP) mRNA, which has been determined to be 
present at -3.5 copies per cell when yeast cells are grown in 
rich medium 103 . Only those genes scored as 'present' are 
represented in the histograms. Data from multiple arrays 
containing probes for a different subset of genes and ESTs 
were combined to generate the plots for human (five arrays), 
mouse (five arrays) and rat (three arrays). All yeast ORFs were 
represented on a single array. For measurements that cover 
such a large number of genes, it is important to maintain high 
standards of data quality to keep false-positive results to a 
minimum. (For example, when monitoring 10,000 genes, 
even a low false-positive rate of 1 % results in 1 00 false calls.) 
We find that the source of most false positives (in large part 
the result of setting the lowest possible, thresholds in the 
interest of sensitivity) is random noise, biological variation, or 
the occasional array-specif ic physical defect, so observations 
made consistently in independent replicates yield a 

false-positive rate close to 0.01 %, or only 1 in 1 0,000. in well controlled experiments involving specific biochemical, chemical and genetic perturbations, typically the number of 
expression differences is modest, with about 0. 1 -2% of the monitored genes changing by a factor of 1 .8 or more, and on|y a small fraction of these changing by more than four- to 
fivefold 56 ^ 8,70 " 72,95 ' 104 . For samples derived, for example, from different adult human or mouse tissues, or from normal versus advanced tumour tissue, the number of differences 
can be as large as 1 0-1 5% of the monitored genes 5 ** 3 . The larger number of differences poses only minor difficulties for the technology, but analysis of the more complex results 
and the larger number of genes involved typically requires more sophisticated computational methods. 
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profile or the 'transcriptome', is a major determinant of cellular pheno- 
type and function. The transcription of genomic DNA to produce 
mRNA is the first step in the process of protein synthesis, and 
differences in gene expression are responsible for both morphological 
and phenotypic differences as well as indicative of cellular responses to 
environmental stimuli and perturbations. Unlike the genome, the 
transcriptome is highly dynamic and changes rapidly and dramatically 
in response to perturbations or even during normal cellular events 
such as DNA replication and cell division 23,24 . In terms of understand- 
ing the function of genes, knowing when, where and to what extent a 
gene is expressed is central to understanding the activity and biological 
roles of its encoded protein. In addition, changes in the multi-gene 
patterns of expression can provide clues about regulatory mechanisms 
and broader cellular functions and biochemical pathways. In the 
context of human health and treatment, the knowledge gained from 
these types of measurements can help determine the causes and conse- 
quences of disease, how drugs and drug candidates work in cells 
and organisms, and what gene products might have therapeutic uses 
themselves or maybe appropriate targets for therapeutic intervention. 

Past discussions of arrays have often centred on technical issues and 
specific performance characteristics 25 : Now that nucleic acid arrays have 
been constructed for many different organisms 14 * 26 " 29 and used success- 
fully to measure transcript abundance in a host of different experi- 
ments, the focus of interest has thankfully shifted. Investigators are now 
more concerned with questions concerning experimental design, data 
analysls/the use of small amounts of mRNA from limited sources, the 
best ways to extract biological meaning from the results, pathway and 
cell-circuitry modelling, and medical uses of expression patterns. 

Array-based gene expression monitoring 

One way to think of measurements with arrays is that they are simply 
a more powerful substitute for conventional methods of evaluating 
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mRNA abundance. For some early experiments, only a relatively 
small set of genes, which were thought to be important to a process, 
were included on the arrays 12,30 . However, such experiments did not 
capitalize on the arrays* potential: a key advantage of using arrays, 
especially those that contain probes for tens of thousands of different 
genes, is that it is not necessary to guess what the important genes or 
mechanisms are in advance. Instead of looking only under the 
proverbial lamppost, a broader, more complete and less biased view 
of the cellular response is obtained (Figs 2, 3); 

The breadth of array-based observations almost guarantees that 
surprising findings will be made, A recent study measured the 
transcriptional changes that occur as cells progress through the 
normal cell-division cycle in humans for approximately 40,000 genes 
(R. J. Cho et al y unpublished results). In addition to the induction of 
DNA replication genes and genes involved with cell-cycle control and 
chromosome segregation that would be expected at specific stages in 
the cell cycle, a large collection of genes involved with smooth muscle 
function, apoptosis and intercellular adhesion and cell motility were 
found to be upregulated during a specific phase. The expected results 
act effectively as internal controls that provide a certain amount of 
validation (and comfort), while new information is obtained by a 
systematic search of a larger part of 'gene space'. In addition, because 
arrays often contain probes for genes of unknown function (and 
often with only partial sequence information), any outcome for these 
could be considered, in some sense, both surprising and novel 
(although clearly requiring further characterization). 

Other gene expression methods 

Not surprisingly, there are other ways to measure mRNA abundance, 
gene expression and changes in gene expression. For measuring gene 
expression at the level of mRNA, northern blots, polymerase chain 
reaction after reverse transcription of RNA (RT-PCR), nuclease 
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Figure 3 Methods for analysing gene 
expression data shown for measurements of 
expression in the cell cycle of S. cerevisiae. 
a, Yeast celts were synchronized and cells 
were collected every ten minutes throughout 
two complete synchronous cycles (1 8 time 
points in total are shown). Expression data 
were collected by hybridizing labelled cDNA 
samples to high-density oligonucleotide 
arrays. Transcript levels were determined for 
almost every gene in the genome for every 
time point 24 . A sample of 409 genes (from a 
total of 6,000) that showed both a significant 
(more than twofold) fluctuation in transcript 
levels during the time course and ceil cycle- 
dependent periodicity were selected for . 
further analysis, b, Dendrogram indicating 
similarity of expression profiles, calculated 
using the Pearson correlation function in the 
GeneSpring software package (Silicon 
Genetics, San Cartos, CA). For display 
purposes, the relative expression levels were 
plotted in red (high) and blue (low), c, The 
genes were divided into five different temporal 
expression classes (red, early G1 ; light blue, 
G1 ; green, late G1 ; dark blue. S; orange, 
G2/M) using K-tuple means clustering (also 

using GeneSpring software) and the clusters were named according to their time of peak expression within the cell cycle, d, Line graphs for all genes in the clusters defined in b. 
e, Location of cell cycle-regulated genes within the dendrogram in a that have as-regulatory sequence elements in the 500 bp upstream of their promoter. Column 1 , MCB sites 
(ACGCGT); column 2, ECB sites (TTWCCCNNNNAGGAA); column 3, a new sequence (GTAAACAA or TTGTTTAC) was identified that was statistically associated (p = 1 .77 x 1 0~ 7 
for the forward direction, p = 0.003 for the reverse) with the promoter regions of genes whose expression peaked in G2/M phase. 
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protection, cDNA sequencing, clone hybridization, differential 
display 31 , subtractive hybridization, cDNA fragment fingerprinting 32 " 35 
and serial analysis of gene expression (SAGE) 36 have all been put to good 
use to measure the expression levels of specific genes, characterize 
global expression profiles or to screen for significant differences in 
mRNA abundance. But if messenger RNA is only an intermediate on 
the way to production of the functional protein products, why measure 
mRNA at all? One reason is simply that protein-based approaches are 
generally more difficult, less sensitive and have a lower throughput than 
RNA-based ones. But more importantly, mRNA levels are immensely 
informative about cell state and the activity of genes, and for most 
genes, changes in mRNA abundance are related to changes in protein 
abundance. Because of its importance, however, many methods have 
been developed for monitoring protein levels either directly or 
indirectly (see review in this issue by Pandey and Mann, pages 
837-846). These include western blots, two-dimensional gels, methods 
based on protein .or peptide chromatographic separation and mass 
spectrometric detection 37- ™, methods that use specific protein-fusion 
reporter constructs and colorimetric readouts 41- * 4 , and methods based 
on characterization of actively translated, polysomal mRNA 45-47 . 

The importance of the protein-based methods is that they measure 
the final expression product rather than an intermediate. In addition, 
some of them enable the detection of post-translational protein modifi- 
cations (for example, phosphorylation and glycosylation) and protein 
complexes, and in some cases, yield information about protein localiza- 
tion, none of which are obtained directly by measurements of mRNA. 
There is no question that protein- and RNA-based measurements are 
complementary, and that protein-based methods are important as they 
measure observables that are not readily detected in other ways. 

Human disease, gene expression and discovery 

Genomics and gene expression experiments are sometimes derided 
as * fishing expeditions 1 . Our view is that there is nothing wrong with a 



fishing expedition 48 if what you are after is 'fish*, such as new genes 
involved in a pathway, potential drug targets or expression markers 
that can be used in a predictive or diagnostic fashion. Because the 
arrays can be designed and made on the basis of only partial 
sequence information, it is possible to include genes in a survey that 
are completely uncharacterized. In many ways, the spirit of this 
. approach is more akin to that of classical genetics in which muta- 
tions are made broadly and at random (not only in specific genes), 
and screens or selections are set up to discover mutants with an 
interesting phenotype, which then leads to further characterization 
of specific genes. 

Such broad discovery experiments are probably better described 
as 'question-driven rather than hypothesis-driven in the conven- 
tional sense. But that is not to diminish their value for understanding 
basic biological processes and even for understanding and treating 
human disease. For example, by analysing multiple samples obtained 
from individuals with and without acute leukaemia or diffuse large 
B-cell lymphoma, gene expression (mRNA) markers were discov- 
ered that could be used in the classification of these cancers 49,50 . The 
importance of monitoring a large number of genes was well illustrat- 
ed in these studies. Golub etal. 49 found that reliable predictions could 
not be made based on any single gene, but that predictions based on 
the expression levels of 50 genes (selected from the more than 6,000 
monitored on the arrays) were highly accurate. The results of both of. 
these studies indicate that measurements with more individuals and 
more genes will be needed to identify robust expression markers that 
are predictive of clinical outcome. But even with the limited initial 
data it was possible to help clarify an unusual case (classic leukaemia 
presentation but atypical morphology) and to use this information 
to guide the patient's clinical care. 

It is also possible to take a related approach to help understand 
what goes wrong in cancerous, transformed cells and to identify 
the genes responsible for disease. Causative effects and potential 
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therapeutic targets can be identified by determining which genes are 
upregulated in different tumour types 51 " 55 , and specific candidate 
genes can be intentionally overexpressed in cell lines or cells treated 
with growth factors in order to identify downstream target genes 
and to explore signalling pathways 56 " 58 . Tumorigenesis is often 
accompanied by changes in chromosomal DNA, such as genetic 
rearrangements, amplifications or losses of particular chromosomal 
loci, and developmental abnormalities, such as Down's or Turner's 
syndrome, may arise from aberrations in DNA copy number. 
Because genomic DNA can be interrogated in much the same way as 
mRNA, comparisons of the copy number of genomic regions or the 
genotype of genetic markers can be used to detect chromosomal 
regions and genes that are amplified or deleted in cancerous or 
pre-cancerous cells. By using arrays containing probes for a large 
number of genes or polymorphic markers, changes in DNA copy 
number have been detected in both breast cancer cell lines and in 
tumours 59 " -61 . The identification of when and where changes in copy 
number or chromosomal rearrangements have occurred can be used 
in both the classification of cancer types and the identification of 
regions that may harbour tumour-suppressor genes. 

Whole-genome hypotheses 

The use of genomics tools such as arrays does not, of course, preclude 
hypothesis-driven research. For fully sequenced organisms, arrays 
containing probes for every annotated gene in the genome have been 
produced 14,26 . With these one can ask, for example, whether a 
transcription factor has a global role in transcription (affecting all 
genes) or a specific role (affecting only some). Holstege era/. 62 used 
this type of application in a genome- wide expression analysis in yeast 
to functionally dissect the machinery of transcription initiation. 
Similarly, genes located near the ends of chromosomes in yeast (as 
well as genes at the mating-type locus) are known to be transcription- 
ally 'silent'. Full genome arrays allow the chromosomal landscape of 
silencing to be mapped, and make it possible to test whether what is 
true for a handful of well-studied genes near the telomeres is true for 
all telomeric genes, and whether any centromere-proximal genes are 
also transcriptionally silenced 63 . 

It is important to emphasize that these new, parallel approaches 
do not replace conventional methods. Standard methods such as 
northern blots, western blots or RT-PCR are simply used in a more 
targeted fashion to complement the broader measurements and to 
follow-up on the genes, pathways and mechanisms implicated by the 
array results. Because the incidence of false-positive results can be 
made sufficiently low (see Fig. 2), it is not necessary to independently 
confirm every change for the results to be valid and trustworthy, 
especially if conclusions are based on changes in sets of genes rather 
than individual genes. More detailed follow-up is recommended if a 
gene is being chosen, for example, as a drug target, as a candidate for 
population genetics studies, or as the target for the construction of a 
knockout mouse. 

Does gene expression indicate function? 

As additional, uncharacterized open reading frames (ORFs) are 
identified in different organisms by the various genome sequencing 
projects, researchers have begun to ask whether the expression pat- 
tern for a gene can be used to predict the functional role of its protein 
product. An increasingly common approach involves using the gene 
expression behaviour observed over multiple experiments to first 
cluster genes together into groups (see Fig. 3), either by manual 
examination of the data 24 , or by using statistical methods such as self- 
organizing maps 64 , K-tuple means clustering or hierarchical cluster- 
ing 23 ' 65,66 . The basic assumption underlying this approach is that 
genes with similar expression behaviour (for example, increasing 
and decreasing together under similar circumstances) are likely to be 
related functionally. In this way, genes without previous functional 
assignments can be given tentative assignments or assigned a role in a 
biological process based on the known functions of genes in the same 
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Figure 4 The 'guilt-by-association' method for assigning gene function. Functional 
distribution (using categories from MIPS: http://www.mips.biochem.mpg.de/prbj/ 
yeast/catalogues/funcat/index.html) of yeast genes whose periodic expression 
peaked at different times in the yeast cell cycle (outer rings) or was constant 
throughout the cell cycle (inner circle) 24 . A much larger fraction of cell cycle- 
modulated genes is important in DNA synthesis, cell growth or ceil division. Although 
there is a strong correlation between distinct expression profiles and functional 
assignments, specific expression behaviour should not be taken as sufficient evidence 
for functional assignment: not all genes involved in DNA replication are expressed 
. periodically in the cell cycle, and some genes that do not need to be cell cycie- 
reguiated are transcribed in a periodic fashion. 



expression cluster (that is, the concept of guilt-by-association ). The 
validity of this approach has been demonstrated for many genes in 
Saccharomyces cerevisiae y a simple organism for which the entire 
genomic sequence and the functional roles of approximately 60% of 
the genes are known 24,65,67 (Fig. 4). Although not logically rigorous, 
the utility of the guilt-by-association approach has been demonstrat- 
ed, as genes already known to be related do, in fact, tend to cluster 
together based on their experimentally determined expression pat- 
terns (Fig. 4). The approach is made more systematic and statistically 
sound by calculating the probability that the observed functional 
distribution of differentially expressed genes could have happened by 
chance. The application of statistical rigour is essential to avoid 
overly subjective interpretations of the results based on the predispo- 
sitions, prior knowledge and interests of the individual researcher. 

A tentative functional assignment may not be much more than a 
low-resolution description or general classification. Descriptions of 
this type are similar to those that come out of more classical genetic 
screens and selections, which have provided the vast majority of 
functional annotations to date — they indicate that genes are 
involved with a particular cellular phenotype and that they are likely 
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Rgure 5 Generic otigonucleotide tag arrays for parallel phenotyping of mutant yeast strains, a, Many S. cerevisiae strains, each carrying a specific deletion of one of the more than 
6,000 ORFs in the yeast genome, have been constructed 91 by replacing individual genes with an antibiotic resistance cassette and a unique gene-specific 20-mer 'barcode', 
represented by an X. b, The barcode for each deletion strain corresponds to a specific location on an array that contains oligonucleotide probes that are complementary to the 
barcode sequences, c, Pools of different yeast strains can be assembled and grown under different conditions. After competitive growth, PCR is used to amplify the barcodes from 
genomic DNA isolated from the pools; the PCR products are subsequently labelled, d, By comparing the hybridization patterns of two different pools (before and after treatment with 
a drug, for example), the fitness of the strains can be assessed quantitatively. In this case, yeast genes required for spoliation or germination are represented in red, whereas yeast 
genes that are unnecessary for the process are shown in yellow. These same 20-mer sequence's and the accompanying arrays are generic in design, and can be used to read the 
results of different types of 'bar-coded' reactions, such as those used for genotyping of human polymorphic loci 105 . Images provided by R M. Williams and R. W. Davis. 



to be involved with a certain set of other genes and processes. This 
allows researchers to focus attention on a smaller subset of genes, 
many of which may not have been obvious candidates in the absence 
of the global expression observations. This overall approach high- 
lights the importance of functional annotation and careful curation 
of existing sequence, function and knowledge databases (see below). 
Expression results covering thousands or even tens of thousands 
of genes and expressed sequence tags (ESTs) will be only partly 
interpretable given the functional and biological information 
available at the time they are initially generated. Our ability to extract 
knowledge from measurements of global gene expression tends to 
increase with time as additional information becomes available, and^ 
results can be subjected to further interrogation in the light of new 
information, observations, questions and hypotheses. 

Gene expression and the regulation of transcription 

When information on the complete genome sequence is available, as 
is the case for increasing numbers of small and even larger genomes, 
gene expression data can be used to identify new cis- regulatory 
elements (genomic sequence motifs that are over-represented in the 
genomic DNA in the vicinity of similarly behaving genes) and 
'regulons* (sets of co- regulated genes), the basic units of the underly- 
ing cellular circuitry (Fig. 3d). In fact, the correlation between the 
presence of specific sequence motifs in promoter regions and gene 
expression patterns may be stronger than the correlation between 
functional categories and gene expression patterns. In yeast studies, 
more than 50% of the genes that are transcribed in a cell cycle- 



specific manner and whose transcript abundance peaks in the Gl 
phase of the cell cycle have an MCB (Mlu cell-cycle box) within 500 
base pairs (bp) of their translational start site 24,68,69 . Similar observa- 
tions have been made for yeast genes whose transcription is induced 
during sporulation 67 . In addition, new os-regulatory elements may 
be revealed by examining classes of co-regulated genes (Fig. 3d). With 
sufficiendy large numbers of experimental observations of expres- 
sion behaviour, the boundaries and all functioning sequence variants 
of cis- regulatory elements might be predicted without the need for 
the more conventional approach using site-directed mutagenesis 
('promoter bashing'). The expression-based method will be especial- 
ly valuable in exotic organisms, such as Plasmodium falciparum* the 
causative agent for malaria, for which experimental identification or 
verification pf transcription factor binding sites is difficult. 

Gene expression profiles as 'fingerprints 1 

An often overlooked aspect of measurements of global gene expres- 
sion is that the sequence or even the origin of the arrayed probes does 
not need to be known to make interesting observations — the 
complex profiles, consisting of thousands of individual observations, 
can serve as transcriptional 'fingerprints'. The fingerprints can be 
used for classification purposes or as tests for relatedness, in a similar 
manner to the way in which DNA fingerprints are used in paternity 
testing. In one example, transcriptional fingerprints have been used 
to determine the target of a drug 70 . The basic idea is that if a drug 
interacts with and inactivates a specific cellular protein, the pheno- 
type of the drug-treated cell should be very similar to the phenotype 
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of a cell in which the gene encoding the protein has been genetically 
inactivated, usually through mutation. Thus, by comparing the 
expression profile of a drug-treated cell to the profiles of cells in 
which single genes have been individually inactivated, specific 
mutants can be matched to specific drugs, and therefore, targets to 
drugs. In a demonstration of this concept, the gene product of the 
his3 gene was identified correctly as the target of 3-aminotriazole 70 . 
Similarly, profiles have been used in the classification of cancers and 
the classification schemes did not depend on any specific informa- 
tion about the genes involved 49,50 , although that information can be 
used to draw further biological and mechanistic conclusions. Finally, 
expression profiles can be used to classify drugs and their mode of 
action. For example, the functional similarity and specificity of 
different purine analogues have been determined by comparing the 
genome-wide effects on treated yeast, murine and human cells 71 ' 72 . 

Expression measurements from small amounts of RNA 

An important frontier in the development of gene expression 
technology involves reduction of the required amount of starting 
material. Most array-based expression measurements are done using 
RNA from a million or more cells, and obtaining such a relatively 
large sample is not a problem in many types of studies (for example, 
litres of yeast cells can be grown easily). However, in some cases, it is 
important or even necessary to use fewer cells, as when using a small 
organ from a fly or worm, sorted cells that express a rare marker, or 
laser-capture microdissected 73-75 tumour tissue. Efficient and 
reproducible mRNA amplification methods are required, and there 
are two primary approaches that show significant promise. The first 
is a PCR-based approach that has been used to make single-cell cDNA 
libraries 76 " 78 . We have found that the amplification is efficient and 
reproducible, but that the relative abundance of the cDNA products 
is not well correlated with the original mRNA levels (D. Giang and 
D. J. Lockhart, unpublished results), although normalization 
and referencing strategies can be used (D. de Graaf and E. Lander, 
personal communication). 

The second approach avoids PCR altogether and uses multiple 
rounds of linear amplification based on cDNA synthesis and a 
template-directed in vitro transcription (IVT) reaction 79-81 . This 
method has been used to characterize mRNA from single live 
neurons 81 and even subcellular regions, and more recently to amplify 
mRNA from 500 to 1,000 cells from microdissected brain tissues for 
hybridization to spotted cDNA arrays 82 . We have found that the 
multiple-round cDNA/IVT amplification method produces suffi- 
cient quantities of labelled material starting with as little as 1-50 ng 
total RNA, is highly reproducible (correlation coefficients greater 
than 0.97), and introduces much less quantitative bias than 
PCR-based amplification (D. Giang and D. J. Lockhart, unpublished 
results). These amplification methods facilitate the possibility of 
monitoring large number of genes starting with very limited 
amounts of RNA and very few cells. The combination of arrays 
and powerful amplification strategies promises to be especially 
important for studies that use human biopsy material from 
inhomogeneous tissue, and in the areas of developmental biology, 
immunology and neurobiology. 

Genome analysis using arrays 

Although nucleic acid arrays are often equated with gene expression 
analysis, they may be used to collect much of the data that are 
obtained presently by Southern or northern blot hybridization tech- 
niques, but in a more highly parallel fashion (Figs 5,6). Their utility 
in polymorphism detection and genotyping is described elsewhere 
(see review in this issue by Roses, pages 857-865), but there are many 
additional uses for these versatile tools. For example, genomic DNA 
samples can be manipulated experimentally to select for particular 
regions before hybridization to obtain specific types of information. 
In yeast, the location of hundreds of chromosomal origins of replica- 
tion can be determined in parallel by enriching for early-replicating 
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regions using a variation of the Meselsohn-Stahl procedure and then 
hybridizing the resulting DNA to full genome arrays (E. A. Winzeler 
et aly unpublished results). Similarly, as probes for more intergenic 
regions are synthesized on arrays, it becomes possible to identify 
protein-binding sites: fragmented chromatin can be crosslinked to a 
protein and then immunoprecipitated with an antibody to that pro- 
tein. The DNA fraction of the immunoprecipitate can be labelled and 
hybridized to identify the approximate location of the binding site. In 
addition, full genome arrays can be used in the analysis of plasmid 
libraries in genetic selections such as two-hybrid screens 83 or, in 
principal, for any other type of experiment in which the information 
is contained in the form of RNA or DNA. Arrays also have 
applications in biophysical chemistry and biochemistry. For 
example, single-stranded DNA arrays were converted enzymatically 
into arrays of double-stranded DNA to characterize the interactions 
of proteins, and potentially other types of molecules, with double- 
stranded DNA 84 . 

Gene expression and cell circuitry 

Is it reasonable to consider the cell as a complex analogue circuit, and 
to attempt to reverse-engineer the cell circuitry much like an electri- 
cal engineer would do by measuring currents and voltages at a variety 
of nodes and under a variety of input conditions? In the case of 
the cell, expression levels and expression changes might take the place 
of electrical measurements, and could be measured under many 
experimental conditions. Is it possible that a genetic or cellular circuit 
of reasonable complexity could be adequately decoded or modelled, 
and if so, how many and what types of measurements and perturba- 
tions (or inputs') would be required so that the problem was not 
hopelessly underdetermined 85 " 89 ? Reasonably detailed circuit 
diagrams can be drawn and simulations of simple genetic circuits 
have been performed for systems of low complexity (for example, the 
lytic cycle of phage lambda, and simple control networks in 
Escherichia coli bacteria 90 ). But the situation is considerably more 
complex in the case of a eukaryotic cell. Using yeast as an example, if 
we assume that the expression level for each gene can be one of only 
four levels (off, low, medium or high), then if the 6,200 yeast genes 
behave independently, there are 6,200\ or -1.5 x 10 15 possible 
expression states. Of course, the expression levels of different genes 
are not all independent of one another, and there are some states that 
are physically unrealistic (for example, all genes 'off* or all genes 
'high'), but the number of possible cellular configurations is very 
large. In addition, coupling between circuit components, the effects 
of nonlinear feedback, redundancy and even noise and stochastic 
events make simulating a circuit of this complexity a rather daunting 
task, and not all relationships and cellular events are reflected at the 
level of mRNA abundance. 

Least clear may be what types of perturbations or inputs are likely 
to be the most informative in terms of defining the relationships 
between genes and pathways, and what might be a minimal set of 
'orthogonal perturbations* (treatments, genetic manipulations or 
growth conditions that have minimal overlap in their direct cellular 
effects) . Certainly it is possible to delete every yeast gene one at a time 
(or even several at a time) and measure the expression profile for each 
mutant strain under a set of different growth conditions 70,91 . It is 
also possible to grow yeast on a matrix of thousands of different 
conditions and measure the resulting expression profiles for a range 
of mutated strains. It is clear that extensive experiments of this type, 
combined with information from other measurements such as 
yeast two-hybrid protein-protein interaction screens 92 , and 
measurements of protein levels, modification states and cellular 
localization will lead to useful groupings of genes in terms of function 
and regulation (that is, a genetic, molecular and functional taxono- 
my), and to supply some reasonably detailed information about the 
relationships between certain genes and pathways. In addition, sets 
of perturbations directed towards specific functions and 
cellular processes will allow higher-resolution and even mechanistic 
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Figure 6 Comparative genome hybridization 
using arrays 28,106,107 , a, Two arrays containing 
probes to yeast (trie complete genome 
sequence of S. cerevlsiae strain S288c and 
some S, oe/ews/aeDNA not present in S288c) 
were hybridized with fragmented, labelled 
genomic DNA from two different yeast strains 
commonly used in genetic studies (W303 and 
SK1). Red indicates the location of probes that 
hybridize efficiently only to DNA from the 
W303 strain, green indicates probes that 
hybridize only to SK1 DNA, and yellow 
indicates probes that hybridize equally to the 
DNA from' both strains, b, Enlargement of the 
boxed region in a. c, Region of the array 
containing probes to relatively unique protein- 
coding regions of the genome, d, Probes to 
non-unique regions of the genome 
(transposable elements, telomeric sequences, 
transfer RNAs and ribosomal RNAs). Genome 
regions that are present, absent, or found at 
higher or lower copy numbers in the two 
strains are readily detected. The large amount 
of allelic variation between the strains can be 
used in mapping studies 108 . Related 
approaches can be used in typing microbial 
isolates 29,109 or to identify genetic 
abnormalities in tumours. 



information for significant parts of the overall circuitry 62,93 . However, 
given the tremendous complexity of the system, it is unlikely that a 
complete and detailed cellular circuit diagram will result for even 
single-celled eukaryotes such as yeast anytime in the near future. But 
that is not to say that construction of even first-order global models 
and semi-quantitative circuit diagrams is not extremely useful. Such 
models serve to organize current information, relationships and 
hypotheses, and can be tremendously helpful for testing new 
hypotheses, interpreting new observations, designing new experi- 
ments and predicting the likely effects of particular chemical, genetic 
or cellular perturbations. They also serve as a scaffold upon which to 
build higher-resolution, more quantitative and complete models. 

Can we have too much data? 

Contrary to what is sometimes thought, the biggest problem for 
making sense of the extensive results from genomics experiments is 
not that there is too much data or that there are insufficiently sophis- 
ticated algorithms and software tools for querying and visualizing 
data on this scale. Larger problems of data management and analysis 
have been solved by airlines, financial institutions, global retailers, 
high-energy and plasma physicists, the military and global weather 
predictors, among others- It is often beneficial to have a large number 
of measurements 94 and sometimes more data make it possible to 
analyse results that might otherwise have been too 'messy*, and to 
detect patterns and relationships that would not have been obvious 
or have sufficient statistical significance with smaller data sets. In 
many types of studies, it is not possible to control completely all 
variables, and the individual differences between common sample 
types may be significant because of experimental difficulties (for 
example, tissue inhomogeneity or variations in sample procedures) 
or individual genetic variation (for example, different patients or dif- 
ferent tumours). But such factors do not preclude the discovery of 
some genes that clearly 'cluster* or differentiate between the sample 
. sets. For example, meaningful results can be extracted from the 
analysis of human tissue collected at different hospitals, by different 
surgeons and at different times. An essential requirement in these 



types of studies is that a sufficient number of experiments be 
performed across multiple individuals and multiple tissue or tumour 
samples to account for individual variation and possible tissue 
inhomogeneity. Furthermore, confidence in the results is increased 
as conclusions are based on sets of genes that show a consistent 
response and that are consistently different between two or more sets 
ofresults 49 ' 50 ' 52 - 53 95 . 

Making sense of genomic results 

Although the difficulties of sample collection, data collection and 
experimental design should not be underestimated, one of the most 
challenging aspects of gene expression analysis is making sense of the 
vast quantities of data and extracting conclusions and hypotheses that 
are biologically meaningful. From experiments on global gene expres- 
sion, we may obtain data for thousands of genes, often forcing us to 
consider processes, functions and mechanisms about which we know 
very little. Thus, there is a need for more sophisticated systems of 
knowledge representation (or 'knowledge bases') that organize the 
data, facts, observations, relationships and even hypotheses that form 
the basis of our current scientific understanding. This information 
needs to be more than just stored; it needs to be available in a 
way that helps scientists understand and interpret the often 
complex observations that are becoming increasingly easy to make. 
Unfortunately, the fact is that the scientific literature has been 
somewhat haphazardly built, without the benefit of a controlled or 
restricted vocabulary and a well defined semantic and grammar. To 
take full advantage of the abilities of the new technologies and the 
rapidly increasing amount of sequence information it is absolutely 
essential to incorporate the facts, ideas, connections, observations 
and so forth, which exist in the scientific literature and in the 
minds of scientists, into a form that is systematic, organized, 
linked, visualized and searchable. This clearly requires a great 
deal of dedicated, systematic human effort, but progress has 
been made. Databases such as the Saccharomyces Genome 
Database (SGD: genome-www.stanford.edu/Saccharomyces), the 
Munich Information Center for Protein Sequences (MIPS: 
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www.mips.biochem.mpg.de), Worm Base (www.wormbase.org), the 
Kyoto Encyclopedia of Genes and Genomes (KEGG: 
www.genome.ad.jp/kegg), the Encyclopedia of £. coli Genes and 
Metabolism (EcoCyc: http://ecocyc.panbio.com/ecocyc) and FlyBase 
(flybase.bio.Indiana.edu/) incorporate sequence, genetics, gene 
expression, homology, regulation, function and phenotype informa- 
tion in an organized and useable form 96 * 102 . But a step beyond databases 
of this type are ones in which concepts as well as facts are more fully 
integrated and related, allowing connections to be made between 
initially disparate observations and information, and across 
organisms. It is conceivable that the next step will evolve to the level of a 
biological 'expert system', not unlike the expert system ('Big Blue*) that 
IBM scientists and engineers built to play chess (successfully) against 
the world's best chess player. Despite the potential for advancement 
on this front, it seems unlikely that computational tools will ever 
replace the trained human brain when it comes to making biological 
sense of new results. However, the appropriate tools are needed to bring 
information and relationships to scientist's fingertips so that the 
most insightful questions can be asked and the most meaningful 
interpretations made. 

Conclusion 

For these array-based methods to become truly revolutionary, they 
must become an integral part of the daily activities of the typical 
molecular biology laboratory. Despite their impressive and rapidly 
growing resumed these technologies are still in their infancy, with 
plenty of room for technical improvements, further development, 
and more widespread acceptance and accessibility. We expect that the 
pattern of development and use of arrays and other parallel genomic 
methodologies will be similar to that seen for computers and other 
high-tech electronic devices, which started out as exotic and expen- 
sive tools in the hands of the few developers and early adopters, and 
then moved quickly to become easier to use, more available, less 
expensive and more powerful, both individually and because of their 
ubiquity. In fact, nucleic acid array-based methods that previously 
seemed exotic, and too expensive, are becoming routine as indicated 
by the huge increase in the number of publications that incorporate 
data obtained in this way. Despite the relative youth of these 
approaches, the achievement of technical goals that would have 
seemed like science fiction only a few years ago is now clearly in view. 
For example, we expect that measuring the expression level of essen- 
tially every gene (including variant splice forms) on an array or two 
starting with RNA from a small number of cells, or even a single cell, 
will soon be possible owing to advances in single-cell handling and 
RNA amplification methods, the output of large-scale sequencing 
efforts and achievable advances in array technology. In the future, 
arrays of peptides, proteins, small molecules, mRNAs, clones, tissues, 
cells and even multicellular organisms such as the nematode worm 
Caenorhabditis elegans may also become common. The combined 
use of all of these highly parallel methods, along with sequence 
information, computational tools, integrated knowledge databases, 
and the traditional approaches of biology, biochemistry, chemistry, 
physics, mathematics and genetics, increases the hopes of 
understanding the function and regulation of all genes and proteins, 
deciphering the underlying workings of the cell, determining the 
mechanisms of disease, and discovering ways to intervene with or 
prevent aberrant cellular processes in order to improve human health 
and well-being. □ 
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